{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "25c9a82f-ea07-434c-a031-e844bc28279d",
   "metadata": {},
   "source": [
    "# Enhance TableGPT Agent with RAG\n",
    "\n",
    "While the [File Reading Workflow](../../explanation/file-reading) is adequate for most scenarios, it may not always provide the information necessary for the LLM to generate accurate code. Consider the following examples:\n",
    "\n",
    "- A categorical column in the dataset contains 'foo', 'bar', and 'baz', but 'baz' only appears after approximately 100 rows. In this case, the LLM may not encounter the 'baz' value through `df.head()`.\n",
    "- The user's query may not align with the dataset's content for several reasons:\n",
    "  - The dataset lacks proper governance. For instance, a cell value might be misspelled from 'foo' to 'fou'.\n",
    "  - There could be a typo in the user's query. For example, if the user queries, \"Show me the data for 'fou',\" but the dataset contains 'foo' instead.\n",
    "\n",
    "In such situations, the Dataset Retriever plugin can be utilized to fetch additional information about the dataset from external sources, thereby providing the LLM with more context and improving its ability to generate accurate responses."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4581e28-08df-4674-9a0f-18d79e2b3c1d",
   "metadata": {},
   "source": [
    "## Quick Start\n",
    "\n",
    "To help you quickly integrate and utilize `RAG` with the `TableGPT Agent`, follow the steps outlined in this section. These instructions will guide you through the process of loading datasets, enhancing retrieval with document compression, and integrating with a powerful LLM-based agent. By the end of this quick start, you'll be able to issue complex queries and receive enriched, context-aware responses.\n",
    "\n",
    "### Step 1: Install Required Dependencies\n",
    "To get started with using RAG in the TableGPT Agent, you need to install the necessary dependencies. The primary package required is langchain, which facilitates building retrieval-augmented workflows.\n",
    "\n",
    "Run the following command to install it:\n",
    "\n",
    "```sh\n",
    "pip install langchain\n",
    "```\n",
    "\n",
    "### Step 2: Load and Prepare Data with CSVLoader\n",
    "\n",
    "The `TableGPT Agent` provides a convenient `CSVLoader` for converting `CSV` or `Excel` files into a format that can be processed by the RAG pipeline. This method allows seamless integration of your data for further retrieval and embedding.\n",
    "\n",
    "**Example Code:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "f0212e29-0de9-487b-a555-8eaf65c519ea",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_core.vectorstores import InMemoryVectorStore\n",
    "from tablegpt.retriever import CSVLoader\n",
    "\n",
    "loader = CSVLoader(\"产品销量表.csv\", autodetect_encoding=True)\n",
    "\n",
    "documents = []\n",
    "async for item in loader.alazy_load():\n",
    "    documents.append(item)\n",
    "\n",
    "# Initialize with an embedding model\n",
    "vector_store = InMemoryVectorStore(embedding=SomeEmbeddingModel())\n",
    "\n",
    "await vector_store.aadd_documents(documents=documents)\n",
    "dataset_base_retriever = vector_store.as_retriever()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d6200c96",
   "metadata": {},
   "source": [
    "### Step 3: Build a Context-Aware Retriever with Document Compression\n",
    "\n",
    "To enhance the retrieval process, `langchain` provides powerful retriever utilities that can be combined with custom compressors. In this step, we utilize the `ColumnDocCompressor` from tablegpt to focus on relevant columns and build an efficient `dataset_retriever`.\n",
    "\n",
    "**Example Code:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "5d253ade-5c85-404e-9b77-88c6e3e5cb9c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.retrievers import ContextualCompressionRetriever\n",
    "from langchain.retrievers.document_compressors import DocumentCompressorPipeline\n",
    "from tablegpt.retriever import ColumnDocCompressor\n",
    "\n",
    "dataset_compressor = DocumentCompressorPipeline(\n",
    "    transformers=[ColumnDocCompressor()]\n",
    ")\n",
    "\n",
    "dataset_retriever = ContextualCompressionRetriever(\n",
    "    base_compressor=dataset_compressor,\n",
    "    base_retriever=dataset_base_retriever,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "169dc7b6-e732-43bf-a9fb-a802719cc0f4",
   "metadata": {},
   "source": [
    "### Step 4: Integrate with TableGPT Agent\n",
    "\n",
    "In this step, we integrate the `dataset_retriever` with the `TableGPT Agent` using an `LLM` and a local execution environment. This setup ensures that the agent can handle user queries effectively by leveraging both the LLM and retrieved dataset context.\n",
    "\n",
    "**Example Code:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "188480c4-fbb2-4a81-b18e-1fc587db4b8e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_openai import ChatOpenAI\n",
    "from pybox import LocalPyBoxManager\n",
    "from tablegpt.agent import create_tablegpt_graph\n",
    "from tablegpt import DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR\n",
    "\n",
    "llm = ChatOpenAI(openai_api_base=\"YOUR_VLLM_URL\", openai_api_key=\"whatever\", model_name=\"TableGPT2-7B\")\n",
    "pybox_manager = LocalPyBoxManager(profile_dir=DEFAULT_TABLEGPT_IPYKERNEL_PROFILE_DIR)\n",
    "\n",
    "agent = create_tablegpt_graph(\n",
    "    llm=llm,\n",
    "    pybox_manager=pybox_manager,\n",
    "    dataset_retriever=dataset_retriever,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d0d8db97",
   "metadata": {},
   "source": [
    "With this setup, your `TableGPT Agent` is ready to process user queries, retrieve relevant data, and generate contextually accurate responses. The integration of RAG techniques ensures that the agent leverages external data effectively, providing enhanced insights and performance.\n",
    "\n",
    "\n",
    "### Step 5: Analyze Data with the TableGPT Agent\n",
    "\n",
    "Finally, you can use the `TableGPT Agent` to perform analysis by sending a query. The response can help determine whether retrieval-augmented generation (RAG) has provided enhanced results. Observing the returned information allows you to assess the accuracy and completeness of the generated response.\n",
    "\n",
    "**Example Code:**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "de23ac53-5ec6-4684-932d-dc75a2d67255",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[HumanMessage(content='桃酥的销售量是多少?', additional_kwargs={}, response_metadata={}, id='b567e1c3-8943-453c-9ebe-fa8d34cfc388'),\n",
       " SystemMessage(content='\\nHere are some extra column information that might help you understand the dataset:\\n- 产品销量表.csv:\\n  - {\"column\": 名称, \"dtype\": \"string\", \"values\": [\"花生桃酥\", ...]}\\n  - {\"column\":  销售额 , \"dtype\": \"string\", \"values\": [\" ￥931,000.00 \", \" ￥225,060.00 \", \" ￥58,500.00 \", ...]}\\n', additional_kwargs={'parent_id': 'some-parent-id'}, response_metadata={}, id='07fdddf4-05e8-4022-9a78-98ee3744aab2'),\n",
       " AIMessage(content=\"为了回答这个问题，我们首先需要读取文件`产品销量表.csv`，然后找到列名包含“名称”和“销售额”的列，特别是需要找到“花生桃酥”的销售量。让我们先读取数据并查看前几行。\\n```python\\nimport pandas as pd\\n\\n# 读取数据\\ndf = read_df(uri='产品销量表.csv')\\n\\n# 显示数据框的前几行\\ndf.head()\\n```\", additional_kwargs={'thought': '为了回答这个问题，我们首先需要读取文件`产品销量表.csv`，然后找到列名包含“名称”和“销售额”的列，特别是需要找到“花生桃酥”的销售量。让我们先读取数据并查看前几行。', 'action': {'tool': 'python', 'tool_input': \"import pandas as pd\\n\\n# 读取数据\\ndf = read_df(uri='产品销量表.csv')\\n\\n# 显示数据框的前几行\\ndf.head()\"}, 'parent_id': 'some-parent-id'}, response_metadata={}, id='27da6f10-2201-4349-bc23-9f7b42f34742', tool_calls=[{'name': 'python', 'args': {'query': \"import pandas as pd\\n\\n# 读取数据\\ndf = read_df(uri='产品销量表.csv')\\n\\n# 显示数据框的前几行\\ndf.head()\"}, 'id': 'be9a29de-7f5d-4010-a85b-37286ab99e86', 'type': 'tool_call'}]),\n",
       " ToolMessage(content=[{'type': 'text', 'text': '```pycon\\n       编号      名称 单位   单价（元）      销售量             销售额 \\n0  mb2033    法式面包  包   ￥7.40   305080   ￥2,257,592.00 \\n1  mb2034    奶昔蛋糕  包   ￥5.80    93200     ￥540,560.00 \\n2  mb2035  奶油夹心饼干  包   ￥3.10   215300     ￥667,430.00 \\n3  mb2036     葱油饼  包   ￥2.20   102300     ￥225,060.00 \\n4  mb2037    花生桃酥  包   ￥3.80   130000     ￥494,000.00 \\n```'}], name='python', id='a48d70fd-2e01-48ee-a9a5-25dc0eec04d6', tool_call_id='be9a29de-7f5d-4010-a85b-37286ab99e86', artifact=[]),\n",
       " AIMessage(content='从数据中我们可以看到，“花生桃酥”的销售量为130,000包。', additional_kwargs={'parent_id': 'some-parent-id'}, response_metadata={}, id='5c5b703d-2eea-444b-a627-0828dca06df2')]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from datetime import date\n",
    "from langchain_core.messages import HumanMessage\n",
    "\n",
    "message = HumanMessage(content=\"桃酥的销售量是多少?\")\n",
    "\n",
    "_input = {\n",
    "    \"messages\": [message],\n",
    "    \"parent_id\": \"some-parent-id\",\n",
    "    \"date\": date.today(),\n",
    "}\n",
    "\n",
    "response = await agent.ainvoke(_input)\n",
    "\n",
    "response[\"messages\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "167c23b6",
   "metadata": {},
   "source": [
    "**Output:**\n",
    "\n",
    "> Here are some extra column information that might help you understand the dataset:\n",
    "> - 产品销量表.csv:\n",
    ">  - {\"column\": 名称, \"dtype\": \"string\", \"values\": [\"花生桃酥\", ...]}\n",
    ">  - {\"column\":  销售额 , \"dtype\": \"string\", \"values\": [\" ￥931,000.00 \", \" ￥225,060.00 \", \" ￥58,500.00 \", ...]}\n",
    "\n",
    "The output confirms that the RAG approach effectively enriches the agent's responses by incorporating dataset context. This improvement allows the agent to provide detailed, actionable insights rather than generic answers, thereby enhancing its utility for complex queries."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
